Practical - Week 1
Florencia Grattarola
(Department of Spatial Sciences)
2022-09-26
Data that can place a particular species in a particular place and time can take many forms.
Data can also be defined as how they were collected.
Finally, data can also be defined as how they are made available for others.
While disaggregated data can produce reliable results for a limited set of well-covered regions, aggregated data types can provide critical information for the extrapolation of biodiversity patterns into less well-sampled regions.
GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.
OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.
eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.
eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.
iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.
Map of Life endeavors to provide ‘best-possible’ species range information and species lists for any geographic area. The Map of Life assembles and integrates different sources of data describing species distributions worldwide.
IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.
rredlist: https://github.com/ropensci/rredlist
IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.
redlistr: https://github.com/ropensci/rredlist
BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.
SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.
sibbr: https://github.com/sibbr
BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.
ALA (Atlas of Living Australia) is a collaborative, digital, open infrastructure that pulls together Australian biodiversity data from multiple sources, making it accessible and reusable.
galah: https://galah.ala.org.au
The open community around the Atlas of Living Australia platform.
BioTime is an open access database global database of assemblage time series for quantifying and understanding biodiversity change.
BioTime Hub: https://github.com/bioTIMEHub
PREDICTS uses data on local biodiversity around the world to model how human activities affect biological communities. This biodiversity change is shown as the Biodiversity Intactness Index (BII).
Pick only one data source.
Open means anyone can freely access, use, modify, and share for any purpose.
Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.
countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code.
recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.
Open data are licensed under open licenses. Some examples:
CC0: Public domain
CC-BY: Attribution
CC-BY-NC: Attribution - Non Commercial
CC-BY-SA: Attribution - Share Alike
Data that are standardized and have an open licence can be shared :)
As an example we will use the mammals of Czech Republic We will access data through GBIF
File > New project > New directory or Existing directory
tidyverse.
We will be using many functions from this package, like filter(), mutate(), and later read_csv().
We will use rgbif.
To use it, we load the library and check it’s working.
So, let’s get the taxon ID for the Mammalia class
$Mammalia
usagekey scientificname rank status matchtype
1 359 Mammalia class ACCEPTED EXACT
2 7423517 Mammamia Akkari, Stoev & Enghoff, 2011 genus ACCEPTED FUZZY
3 9522622 Mammaria genus ACCEPTED FUZZY
4 7688954 Mammaria Cesati ex Rabenhorst, 1854 genus ACCEPTED FUZZY
5 2573090 Mammaria Oken, 1815 genus SYNONYM FUZZY
6 6008010 Mammaria Müller, 1776 genus DOUBTFUL FUZZY
7 4899044 Mammalian Prions family ACCEPTED FUZZY
canonicalname confidence kingdom phylum kingdomkey phylumkey classkey
1 Mammalia 94 Animalia Chordata 1 44 359
2 Mammamia 74 Animalia Arthropoda 1 54 361
3 Mammaria 74 Fungi Ascomycota 5 95 320
4 Mammaria 74 Chromista Myzozoa 4 8770992 NA
5 Mammaria 73 Chromista Myzozoa 4 8770992 9049014
6 Mammaria 68 Animalia <NA> 1 NA NA
7 Mammalian 64 Viruses <NA> 8 NA NA
synonym class
1 FALSE Mammalia
2 FALSE Diplopoda
3 FALSE Sordariomycetes
4 FALSE <NA>
5 TRUE Dinophyceae
6 FALSE <NA>
7 FALSE <NA>
note
1 <NA>
2 Similarity: name=75; authorship=0; classification=-2; rank=0; status=1; score=74
3 Similarity: name=75; authorship=0; classification=-2; rank=0; status=1; score=74
4 Similarity: name=75; authorship=0; classification=-2; rank=0; status=1; score=74
5 Similarity: name=75; authorship=0; classification=-2; rank=0; status=0; score=73
6 Similarity: name=75; authorship=0; classification=-2; rank=0; status=-5; score=68
7 Similarity: name=75; authorship=0; classification=-12; rank=0; status=1; score=64
order family genus orderkey familykey genuskey
1 <NA> <NA> <NA> NA NA NA
2 Julida Julidae Mammamia 1019 4012 7423517
3 Sordariales Lasiosphaeriaceae Mammaria 1061 4162 9522622
4 <NA> <NA> Mammaria NA NA 7688954
5 Noctilucales Noctilucaceae Noctiluca 8808938 8267551 7443358
6 <NA> <NA> Mammaria NA NA 6008010
7 <NA> Mammalian <NA> NA 4899044 NA
acceptedusagekey
1 NA
2 NA
3 NA
4 NA
5 7443358
6 NA
7 NA
So, let’s get the taxon ID for the Mammalia class
And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.
How many occurrence records are in GBIF for the entire Czech Republic?
And how many records for the mammals of Czech Republic?
We are ready to do a download. Whoop!
To do this, we will use occ_search().
occ_search(
taxonKey = NULL,
scientificName = NULL,
country = NULL,
publishingCountry = NULL,
hasCoordinate = NULL,
typeStatus = NULL,
recordNumber = NULL,
lastInterpreted = NULL,
continent = NULL,
geometry = NULL,
geom_big = "asis",
geom_size = 40,
geom_n = 10,
recordedBy = NULL,
recordedByID = NULL,
identifiedByID = NULL,
basisOfRecord = NULL,
datasetKey = NULL,
eventDate = NULL,
catalogNumber = NULL,
year = NULL,
month = NULL,
decimalLatitude = NULL,
decimalLongitude = NULL,
elevation = NULL,
depth = NULL,
institutionCode = NULL,
collectionCode = NULL,
hasGeospatialIssue = NULL,
issue = NULL,
search = NULL,
mediaType = NULL,
subgenusKey = NULL,
repatriated = NULL,
phylumKey = NULL,
kingdomKey = NULL,
classKey = NULL,
orderKey = NULL,
familyKey = NULL,
genusKey = NULL,
establishmentMeans = NULL,
protocol = NULL,
license = NULL,
organismId = NULL,
publishingOrg = NULL,
stateProvince = NULL,
waterBody = NULL,
locality = NULL,
limit = 500,
start = 0,
fields = "all",
return = NULL,
facet = NULL,
facetMincount = NULL,
facetMultiselect = NULL,
skip_validate = TRUE,
curlopts = list(),
...
)Get occurrences records of mammals from Czech Republic.
Records found [6345]
Records returned [500]
No. unique hierarchies [39]
No. media records [500]
No. facets [0]
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
fields=all]
# A tibble: 500 × 98
key scien…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶ hosti…⁷ publi…⁸
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 40115… Dama d… 49.2 16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
2 40116… Castor… 50.2 14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
3 40150… Myocas… 49.7 15.1 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
4 40181… Myocas… 50.1 14.4 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
5 40149… Sus sc… 49.2 16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
6 40149… Dama d… 49.2 16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
7 40149… Capreo… 49.6 16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
8 40149… Lepus … 49.6 16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
9 40149… Myocas… 50.1 14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
10 40148… Myocas… 49.8 14.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
# … with 490 more rows, 88 more variables: protocol <chr>, lastCrawled <chr>,
# lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
# occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
# classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
# speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
# kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>,
# species <chr>, genericName <chr>, specificEpithet <chr>, taxonRank <chr>, …
Check the data output. What’s the format? How many rows does it have?
Get all occurrences records of mammals from Czech Republic.
Mammals occurrence records from the Czech Republic
# A tibble: 6,000 × 179
key scien…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶ hosti…⁷ publi…⁸
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 40115… Dama d… 49.2 16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
2 40116… Castor… 50.2 14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
3 40150… Myocas… 49.7 15.1 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
4 40181… Myocas… 50.1 14.4 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
5 40149… Sus sc… 49.2 16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
6 40149… Dama d… 49.2 16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
7 40149… Capreo… 49.6 16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
8 40149… Lepus … 49.6 16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
9 40149… Myocas… 50.1 14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
10 40148… Myocas… 49.8 14.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US
# … with 5,990 more rows, 169 more variables: protocol <chr>,
# lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
# occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
# classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
# speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
# kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>,
# species <chr>, genericName <chr>, specificEpithet <chr>, taxonRank <chr>, …
Mammals occurrence records from the Czech Republic
How many records do we have?
How many species do we have?
distinct() is used to see unique values
Data are not ‘good’ or ‘bad’, the quality will depend on our goal.
Some things we can check:
CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner
Automated flagging of common spatial and temporal errors in data.
As an example, we will check the following fields:
basisOfRecord: we want preserved specimens or observationstaxonRank: we want records at species level.coordinateUncertaintyInMeters: we want them to be smaller than 10km.basisOfRecord: we want preserved specimens or observationsdistinct() is used to see unique values
basisOfRecord: we want preserved specimens or observationsgroup_by() is used to group values within a variable
basisOfRecord: we want preserved specimens or observationsNote the use of | (OR) to filter the data.
taxonRank: we want records at species leveltaxonRank: we want records at species levelcoordinateUncertaintyInMeters: we want them to be smaller than 10kmmammalsCZ %>%
filter(coordinateUncertaintyInMeters > 1000) %>%
select(scientificName, coordinateUncertaintyInMeters, stateProvince)# A tibble: 505 × 3
scientificName coordinateUncertaintyInM…¹ state…²
<chr> <dbl> <chr>
1 Myotis nattereri (Kuhl, 1817) 26454 Středo…
2 Myotis myotis (Borkhausen, 1797) 26454 Středo…
3 Myotis myotis (Borkhausen, 1797) 26454 Středo…
4 Myotis myotis (Borkhausen, 1797) 26454 Středo…
5 Rhinolophus hipposideros (Bechstein, 1800) 26454 Středo…
6 Rhinolophus hipposideros (Bechstein, 1800) 26454 Středo…
7 Myotis myotis (Borkhausen, 1797) 26454 Středo…
8 Barbastella barbastellus (Schreber, 1774) 26454 Středo…
9 Barbastella barbastellus (Schreber, 1774) 26454 Středo…
10 Plecotus auritus (Linnaeus, 1758) 26454 Středo…
# … with 495 more rows, and abbreviated variable names
# ¹coordinateUncertaintyInMeters, ²stateProvince
coordinateUncertaintyInMeters: we want them to be smaller than 10kmWe’ll get to this next week :)